Testing for a difference in means of a single feature after clustering

For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we propose a new test for the difference in means in a single feature between a pair of clusters obtained using hierarchical or k-means clustering. The test based on the proposed p-value controls the selective Type I error rate in finite samples and can be efficiently computed. We further illustrate the validity and power of our proposal in simulation and demonstrate its use on single-cell RNA-sequencing data.


Introduction
Clustering algorithms, a collection of computational tools designed to group unlabelled data, are ubiquitously applied across fields to preprocess, visualize, and compress large data sets [15].It is often of interest to interpret and validate the results from clustering a data set: for instance, in the context of single-cell RNA sequencing (scRNA-seq), researchers often cluster cells based on their gene expression profiles, and want to interpret the resulting clusters as categorical measures of an unobserved aspect of the cells' biological state, such as cell type [1,11].Similarly, in market segmentation, an analyst might cluster customers according to their measurable characteristics such as age, gender, and spending habits, and subsequently assign each resulting cluster a descriptive label (e.g."outdoorsy customers") to inform market design [18].
Here, we consider how to determine which features are significantly different between two groups obtained via a clustering algorithm.Concretely, suppose that we applied a clustering algorithm to divide n observations into K non-overlapping groups based on q features.For a pair of groups and for a feature j ∈ {1, 2, . . ., q}, we want to answer the question: "How can we assess whether the population means of the jth feature are the same between the two groups?" Answering this question yields valuable insights.First, identifying the subset of features that appear to have different population means across cluster pairs facilitates cluster interpretation.
For instance, in scRNA-seq, if the data suggests that the two cell clusters have different population mean expression levels for known marker genes of specific cell subtypes (e.g., helper T cells and killer T cells), this supports interpreting these cell clusters as the corresponding subtypes.
Second, answers to this question could assist in evaluating the validity and generalizability of the obtained clusters.Given that clustering algorithms always output distinct clusters -even when applied to observations from a single population -observing at least one feature with population means across clusters increases our confidence in the resulting clusters, as well as the potential for generalizing our clustering results to new independent data sets.
To ascertain whether the population mean of each feature is the same between groups, applying a classical test for equality of means between two populations (e.g., the two-sample t-test) for each feature and cluster pair might seem intuitive.However, such an approach ignores the fact that the null hypothesis of equal population means of a given feature between two clusters depends on the data used for testing, since the clusters are estimated on the same data.This leads to a failure to control the selective Type I error rate [9]; that is, the probability of falsely rejecting the null hypothesis, given that we chose to test it.Furthermore, sample splitting does not provide an adequate solution in this context, as clustering a subset of the observations does not directly lead to cluster assignments for the remaining observations; detailed discussion is available in Chen and Witten [6], Gao et al. [10], Neufeld et al. [20].
In this paper, we develop a finite-sample selective inference framework [9] for testing for a difference in means of a single feature in two clusters, under a multivariate Gaussian assumption.
In short, to account for the fact that the clusters are estimated using the same data used for testing, we condition on the event that the clustering algorithm outputs a particular partition of the observations, thereby controlling the selective Type I error rate.In the special case of k-means clustering and hierarchical clustering -two of the most popular forms of clustering -we provide an analytical characterization of the conditioning set that enables efficient and exact computation of our proposed p-value.
Our work is closely related to Gao et al. [10] and Chen and Witten [6] and amounts to extending their selective inference framework for testing the difference in vector means to individual feature means.While this manuscript is under preparation, Hivert et al. [13] proposed a related selective inference framework for the difference between the mean of a single feature in two clusters.
Compared to their work, our proposal (i) does not assume that the features used for clustering are independent; and (ii) computes the p-value exactly with a computationally efficient algorithm, rather than approximating the p-value via Monte Carlo sampling.Methods developed in this paper are implemented in the R package CADET (Clustering And Differential Expression Testing) available at https://github.com/yiqunchen/CADET. Data and code for reproducing the results in this paper can be found at https://github.com/yiqunchen/CADET-experiments.
The rest of our paper is organized as follows.In Section 2, we review the problem of testing for a difference in means after clustering.In Sections 3 and 4, we propose tests that control the selective Type I error rate when testing for a difference in means after hierarchical or k-means clustering, and provide a computationally-efficient approach to compute the p-values corresponding to our proposal.We evaluate our proposal in a simulation study in Section 5 and apply our proposal to real datasets in Section 6. Proofs and additional results are relegated to the Appendix.
Throughout this paper, we will use the following notational conventions.I n , 0 n and 1{•} denote the n-dimensional identity matrix, n-vector of zeros, and the indicator function, respectively.For a matrix A, A i denotes the ith row and A ij denotes the (i, j)th entry.For a vector ν ∈ R n , ∥ν∥ 2 denotes its ℓ 2 norm, and Π ⊥ ν is the projection matrix onto the orthogonal complement of ν, i.e., 2 Motivation

Model and data
Let x ∈ R n×q be a data matrix with n observations of q features.For µ ∈ R n×q with unknown rows µ i ∈ R q and known, positive-definite Σ ∈ R q×q , we assume that x is a realization of a random matrix X, where rows of X are independent and drawn from a multivariate normal distribution: 2.2 Testing two pre-defined groups be the mean of the jth feature in the group G. Consider using x to test the null hypothesis that there is no difference in the mean of the jth feature across two pre-defined, non-overlapping groups G, G ′ ⊂ {1, 2, . . ., n}, i.e.
This is equivalent to testing , where Φ(•) is the cumulative distribution function of the standard normal distribution.

What changes when the groups are estimated clusters?
Let C(•) be a clustering algorithm that takes in a data matrix x with n rows and outputs a partition of {1, 2, . . ., n}.Suppose that we now want to use x to test the null hypothesis that there is no difference in the mean of the jth feature across two groups obtained by applying C(•) to x, i.e.

Ĥ0j : μ
where Ĝ, Ĝ′ ∈ C(x) are a pair of estimated clusters.This is equivalent to testing Ĥ0j : [µ T ν] j = 0 versus Ĥ1j : [µ T ν] j ̸ = 0, where ν is the n-vector with ith element given by The challenge is that Ĥ0j is a function of the data used to test it, because Ĝ and Ĝ′ are estimated clusters in C(x).We could naively treat Ĝ and Ĝ′ as pre-specified groups, and test Ĥ0j by applying the two-sample Z-test as described in Section 2.2.This would lead to the following p-value: That is, we could compare the values of x ij for i ∈ Ĝ ∪ Ĝ′ to the distribution of X ij for i ∈ Ĝ ∪ Ĝ′ .
However, because C(X) is random and dependent on X, the distribution of X ij stratified by C(X) can be far from the distribution of X ij stratified by C(x).Consequently, over repeated realizations of X, applying the two-sample Z-test to compare the means of two estimated clusters will lead to anti-conservative inference.
To illustrate this in an example, we simulate data from (1) with n = 150, q = 10, and with Σ ij = 1{i = j} + 0.4 • 1{i ̸ = j} so that we have two equally sized true clusters that differ in the first and last features only.In each simulated data set, we apply k-means clustering to obtain two clusters, and test for a difference in the means of the estimated clusters for each of the eight features.Across all of the simulated data sets, there is no difference in the means of the estimated clusters for features 2-8 under (7); see Figure 1(a).Nevertheless, there can be a substantial difference in the empirical means of the estimated clusters for features 2-8; see Figure 1(b).Thus, over 1,500 simulated data sets, and over features 2-8, the p-values from the two-sample Z-test appear far smaller than a Uniform(0, 1) distribution; by contrast, the p-values from our proposed test (Section 3) follow a Uniform(0, 1) distribution (Figure 1(c)).
3 Selective inference for the mean of a single feature We will overcome the challenges discussed in Section 2.3 by developing a selective inference framework [9] for testing the equality of the means of a single feature between two estimated clusters.(c): Quantile-quantile plot of the p-values from the two-sample Z-test applied to the estimated clusters (defined in ( 6)) and our proposal (defined in (10)), aggregated over 1,500 simulated data sets and over features 2-8 (i.e., the features with no true difference in means across any pairs of estimated clusters).

The "ideal" p-value
In the setting outlined in Section 2.3, we chose to test Ĥ0j in (4) using x because Ĝ, Ĝ′ ∈ C(x).
Thus, Fithian et al. [9] argues that we should apply a test that controls the selective Type I error rate at level α, which guarantees that the proportion of times we falsely reject a selected null hypothesis is controlled at level α over repeated realizations of X: This motivates the following conditional version of the two-sample Z-test in (6) to test Ĥ0j : where we conditioned on {C(X) = C(x)} because the hypothesis of interest was chosen based on the clustering output.By the probability integral transform, rejecting Ĥ0j if the p-value in ( 9) is less than α controls the selective Type I error rate at level α.
In practice, computing ( 9) is challenging, as (i) the conditional distribution of [X T ν] j depends on unknown parameters that are left unspecified by Ĥ0j ; and (ii) the conditioning set {X ∈ R n×q : C(X) = C(x)} depends on the clustering algorithm C and could be highly non-trivial to characterize.In Section 3.2, we will overcome these two challenges by modifying ( 9) to condition on extra information; this leads to a computationally tractable test that controls the selective Type I error rate when the clusters are obtained via hierarchical or k-means clustering.

Truncated Gaussian p-value
To overcome the challenges in computing ( 9), we condition on additional events and compute: where Compared to ( 9), we have conditioned on {U (X) = U (x)}.This choice does not sacrifice control of the selective Type I error rate (see Proposition 3 in Fithian et al. [9]).Furthermore, we can rewrite X in (1) as: where the first term is in the conditioning set of (10) and the second term depends on X only through our test statistic [X T ν] j .It turns out that the two terms on the right-hand-side of (12) are independent under model (1).Thus, when evaluating the conditional probability in (10), we only need to consider the randomness in X coming from the scalar-valued test statistic [X T ν] j , despite the fact that all of X is involved in the conditioning event {C(X) = C(x)}.
This intuition is formalized in the following result, which says that (i) computing ( 10) involves a truncated univariate normal distribution; and (ii) testing Ĥ0j using (10) controls the selective Type I error rate.
Then, for p j,selective defined in (10), we have that where Furthermore, the test that rejects Ĥ0j : μ Ĝj = μ Ĝ′ j when p j,selective ≤ α controls the selective type I error rate at level α, in the sense of (8).
It follows from Theorem 1 that computing the selective p-value in (10) amounts to computing the truncation set in (14).The next section is dedicated to understanding and computing this truncation set.
4 The truncation set

Intuition
The set Ŝj defined in ( 14) represents the values of ϕ for which the clustering algorithm C, when applied to x ′ (ϕ, j), yields the clustering output C(x).Here, T can be interpreted as a perturbation to the observed data x.When put together, panels (a)-(c) reveal that (i) ϕ can be interpreted as the observed "test statistic" x T j ν on x ′ (ϕ); (ii) varying ϕ only changes the values of clusters Ĝ, Ĝ′ and leaves the other clusters (e.g., the orange cluster in Figure 2) alone; and (iii) ϕ moves the observed difference in all features correlated with the feature being tested (e.g., feature 2 in Figure 2); we visualize the magnitude of the changes in Figure 2(d).In this case, the slope of the blue line is the correlation between the two features.x ′ (ϕ, j) and the slope of the blue line is Σ 12 /Σ 11 , where Σ is the covariance matrix of the features.

Computing Ŝ for hierarchical clustering
We first review an important result from Gao et al. [10].For any w ∈ R q , define the set Theorem 2 is a direct generalization of results in Section 3 of Gao et al. [10].In short, Section 3.2 of Gao et al. [10] shows that S(w) is the intersection of O(n 2 ) sets, where n is the number of observations.Sections 3.3 and 3.4 of Gao et al. [10] further reveal that for average, Ward, centroid, and single linkage, each of the O(n 2 ) intersected sets are the solution sets to a quadratic inequality in ϕ.Observing that we can take the intersection of the solution sets of N quadratic inequalities in O(N log N ) operations, and carefully analyzing the number of operations needed to compute the coefficients of the quadratic inequalities using the squared Euclidean distance matrix, leads to the worst-case time complexities listed in Theorem 2.
Since Ŝ in ( 14) can be written as Ŝ = S(ν, Σ j /Σ jj ); it follows from Theorem 2 that computing Ŝ for hierarchical clustering requires the worst-case time complexities listed in Theorem 2.

Extensions to k-means clustering
In this section, we extend the proposed p-value (10) to the k-means clustering algorithm and outline an efficient computational recipe.Because k-means clustering iteratively updates the cluster assignment [19], characterizing {X : C(X) = C(x)}, where C denotes the final clusters at convergence, may require enumerating possibly an exponential number of intermediate cluster assignments.Hence, following Chen and Witten [6], we condition on all of the intermediate clusters in the k-means clustering algorithm to arrive at this extension of the p-value in (10): i (•) is the assigned cluster of observation i at the tth iteration of the k-means algorithm, U (•) is defined in (11), and T is the total number of iterations run during k-means clustering.
As in Section 3, conditioning on additional information in (16) still guarantees selective type I error control, and p j,k-means can be computed using a univariate truncated Gaussian distribution; this is formalized in Proposition A1 in the Appendix.Regarding computation, we can extend the ideas in Section 3 of Chen and Witten [6] to efficiently compute the conditioning set in (16).The key idea is that we can recast the computation to solving O(nT K) number of quadratic inequalities in ϕ and intersecting the resulting solution sets, taking O((n+q)KT +nKT log(nKT )) operations in total (see details in the Appendix).
5 Simulation study

Overview
Throughout this section, we consider testing the null hypothesis Ĥ0j : μ Ĝj = μ Ĝ′ j versus Ĥ1j : μ Ĝj ̸ = μ Ĝ′ j , where, unless otherwise stated, Ĝ and Ĝ′ are a randomly-chosen pair of clusters from k-means or hierarchical clustering, and j is a randomly-chosen feature.We consider the following p-values: p j,naive in (6), p j,k-means in ( 16), p j,average , p j,centroid , and p j,single (defined in (10) where C is hierarchical clustering with average, centroid, and single linkage, respectively).
For each simulated data set, we apply k-means clustering and hierarchical clustering with average, centroid, and single linkage to estimate three clusters.We then compute p j,naive (based on the output from k-means clustering), p j,k-means , p j,average , p j,centroid , and p j,single for a randomlychosen pair of clusters and a random feature between 2 and q − 1.
Figure 3 displays the observed p-value quantiles versus the Uniform(0,1) quantiles.We see that for all values of q and ρ, (i) the naive p-values in (6) are stochastically smaller than a Uniform(0,1) random variable, suggesting that the test based on p j,naive leads to an inflated Type I error rate (the number of false rejections increases as the underlying feature correlation increases); (ii) tests based on p j,k-means , p j,average , p j,centroid , and p j,single control the selective Type I error rate in the sense of (8).
For each simulated data set, we computed p j,k-means , p j,average , p j,centroid , and p j,single for a randomly-chosen pair of clusters and rejected Ĥ0j : μ Ĝj = μ Ĝ′ j if these p-values were less than α = 0.05.Note that different clustering methods may estimate different clusters in a single data set, leading to different null hypotheses.Thus, our analysis evaluates both the conditional power of the tests and the detection probability of the employed clustering methods [6,10,14,16].We define the conditional power as the probability of rejecting Ĥ0j in (4) given that Ĝ and Ĝ′ are true clusters.Given M simulated data sets with true clusters {G 1 , . . ., G L }, we estimate it as: where p (m) and Ĝ(m) , Ĝ′ (m) denote the p-value and estimated clusters under consideration for the mth simulated data set.Because the quantity in (18) conditions on the event that Ĝ1 and Ĝ2 are true clusters, we also estimate how often that event occurs, i.e., the detection probability: Figures 4 displays the conditional power (18) for the tests based on p j,k-means , p j,average , p j,centroid , or p j,single .In cases where simulations did not recover the true clusters, we've conventionally set the conditional power to zero.Under model ( 1) with µ defined in (17), the conditional power increases as a function of the difference in feature means δ across all proposed p-values and feature correlation ρ.For a given q, a larger value of ρ leads to lower conditional power, especially for the test based on p j,k-means .Moreover, for a given value of δ and q, the ordering of power for different tests depends on the correlation between features: with independent features (left column of Figure 18), the test based p j,average and p j,k-means .By contrast, when features are highly correlated (right column of Figure 18), the tests based on p j,single and p j,k-means yield the highest and the lowest power, respectively.
The observed trends are congruent with the anticipated behaviour of individual clustering algorithms: For instance, k-means clustering, which uses within-cluster-sum-of-squares, tends to underperform when features are highly correlated.By contrast, single linkage hierarchical clustering, making use of the minimal distance between clusters, thrives in settings with high signal-to-noise ratio and high feature correlation.Figure 5 displays the relative performance of cluster recovery, characterized using detection probability (19).
6 Applications to scRNA-seq data In this section, we apply the proposed p-values to single-cell RNA-sequencing data collected by the Tabula Sapiens Consortium [7], which measures messenger RNA expression levels in each of 500,000 cells from 24 different tissues and organs.These data have enabled new insights into the distinct cell types within the human organism and created a detailed molecular definition of these cell types.To reveal biological insights on how gene expression levels change across cell types, biologists typically perform clustering on the cells, and then perform a differential expression analysis, i.e., they test for a difference in gene expression between two clusters [1,11].In this approach, ignoring the fact that the clusters were estimated from the same data used for testing,   One unique feature of the Consortium data set is that experts annotated cell types consistently across the different tissues.We will make use of the labelled cell types as the "ground truth" and use this information to demonstrate that our proposed p-values in Section 3 yield biologically reasonable results.As per standard pre-processing techniques [8], we first excluded cells with low numbers or total counts of expressed genes, as well as cells in which a large percentage of the expressed genes are mitochondrial.We then normalized the transcripts for each cell by the total sum of counts in that cell, followed by a log 2 (x + 1) transformation.
We applied the pre-processing pipeline separately on two sets of cells collected from the same donor: the CD4-positive, alpha-beta T cells, and a combined sample of four cellular types from an identical donor, namely memory B cells, natural killer cells, macrophages, and monocytes.
We considered only the subset of 500 genes with the largest sample variance in expression levels post-normalization.
To investigate the selective type I error rate in the absence of true clusters, we first consider a "no cluster" data set consisting of only CD4-positive, alpha-beta T cells after pre-processing (thus, n = 833 and q = 500).We applied k-means clustering with K = 4 to obtain four estimated clusters.For each pair of estimated clusters and each feature j = 1, . . ., 500, we computed the p-values pj,naive and pj,k-means (where the p emphasizes that we used the sample covariance matrix as an estimate of Σ in (1)).The quantile-quantile plot of the resulting p-values is displayed in Figure 6(a).We display the number of rejected hypotheses after FDR correction using the BH procedure [3] against the nominal FDR level in Figure 6(b).In this data set, the naive p-values are extremely small for all pairs of estimated clusters, leading to hundreds of rejected null hypotheses of equal means, while our proposed p-values are quite large and lead to virtually no rejections after FDR correction.In particular, at α = 0.05 and FDR level of 0.20, the test based on pj,naive would conclude that more than 60% genes are "differentially expressed", whereas our approach would suggest that expression levels across clusters are the same for most genes.Because this "no cluster" data set consists only of a single type of expert-annotated cell from a single donor, we believe the conclusion based on pj,k-means aligns better with the underlying biology.
Next, we turn our attention to the "cluster" data set, which encompasses memory B cells, natural killer cells, macrophages, and monocytes.We applied k-means clustering to obtain four clusters, and subsequently estimated a covariance matrix based on the residuals from the k-means fit.Notably, the clusters derived in this manner align closely with the four distinct cell types, with an adjusted Rand Index between the cell type annotations and estimated cluster memberships exceeding 0.6.We then computed the p-values pj,naive and pj,k-means across all features and for each pair of the estimated clusters.The quantiles of these p-values, as well as the number of null hypotheses rejected following FDR adjustment, are depicted in panels (a) and (b) of Figure 7, respectively.Notably, both sets of p-values on this data set are quite small, and the BH procedure results in a comparable count of rejections for both sets of p-values.Given that the clusters in this context largely correspond to distinct cell types, our results suggest that the test employing our proposed p-value has reasonable power to reject the null hypothesis when it does not hold.

Discussion
In this work, we proposed a test for a difference in means for a single feature between two clusters estimated from hierarchical or k-means clustering, under (1).Here, we outline several future research directions.
The p-values introduced in Section 3 can be extended to test for a difference in means between groups of features.For instance, if we want to test for equality in means for all features j ∈ J between two estimated clusters, i.e., Ĥ0J : j∈J μ Ĝj = μ Ĝ′ j .Following the argument in this paper, the following p-value 2 Σ JJ controls the selective Type I error rate under (1) and can be efficiently computed.Here, [x T ν] J ∈ R |J| represents the vector subset with indices in J; Σ J ∈ R q×|J| is the submatrix in Σ with columns in J; and Σ JJ ∈ R |J|×|J| is the submatrix in Σ with row and column indices in J.
Furthermore, our p-values can be used to derive selective confidence intervals for a difference in means for feature j [9,17].Under our setup, computing the confidence intervals amounts to a root-finding problem, which can be efficiently solved using bisection [4,5].This extension would enhance data uncertainty evaluation: for instance, both p-values and confidence intervals on differing gene expression profiles are used to guide downstream scientific inquiries.
Thus, to show (22), it suffices to show that the two terms in the sum are independent of [ XT ν] j ; i.e., and We will start by showing (23).Observe that To show (24), we will need to state and prove an intermediate result.
Proof.Observe that Thus, it follows from properties of multivariate normal distributions that z 1 and z 2 − Σ 21 Σ −1 11 z 1 are independent.Since X T ν ∼ N q (µ T ν, ∥ν∥ 2  2 Σ), it follows from Lemma 1 that where [X T ν] −j denotes the result of removing the jth component from X T ν ∈ R q , and Σ j,−j denotes the result of removing the jth component from Σ j ∈ R q .Finally, to complete the proof of (24), note that by algebra, the jth entry of the vector Σ jj [X T ν] j is always 0. This implies that any random variable that is independent of the q − 1 subvector [X 25) is also independent of the full vector; this completes the proof.
We have now shown ( 22).Thus, we can apply (22) to simplify (21) as: where the second equality follows from the definition of Ŝj in equation ( 14) of the main text as . Thus, defining ϕ ∼ N (0, ∥ν∥ 2 2 Σ jj ), we can rewrite (26) as: where F(t; µ, σ, S) denotes the cumulative distribution function (CDF) of a N (µ, σ 2 ) random variable truncated to the set S. This is equation (13) in the main text.
We will now prove that rejecting Ĥ0j based on p j,selective ≤ α controls the selective type I error rate in the sense of (8).First, recall that we decided to test Ĥ0j : [µ T ν] j = 0 because C(X) = C(x).
Thus, we need to show: Define the function Then, observe that the following holds for any α ∈ (0, 1): where the first equality follows from (27), and the second equality follows from the probability integral transform theorem.Therefore, we have that where the first equality follows from the tower property of conditional expectation, and the second equality follows from (30).That is, equation (28) holds with equality.
B Additional details for the extensions to k-means clustering in Section 4.3 In this section, we provide more details on the extensions of proposals in Section 3 of the main text to k-means clustering.

B.1 A brief overview of the k-means algorithm
We briefly review the k-means clustering algorithm in this section.For a set of samples x 1 , . . ., x n ∈ R q and a positive integer K, k-means clustering partitions the n samples into disjoint subsets Ĉ1 , . . ., ĈK by solving the following optimization problem [2,19]: It is not typically possible to solve for the global optimum in (31) [2]; one of the most popular approaches is Lloyd's algorithm [19], given in Algorithm 1.In Lloyd's algorithm, we first sample K out of n observations as initial centroids (step 1 in Algorithm 1), followed by assigning each observation to its closest centroid (step 2).Next, we iterate between steps 1 and 2 until the cluster assignments stop changing, and the algorithm is guaranteed to converge to a local optimum [12].

B.2 Characterization and efficient computation of p j,k-means
Theorem 3 below is a direct extension of Theorem 1 (stated in Section 3.2 of the main text and proven in Appendix A) to the case of k-means clustering.Recalling U Theorem 3. Suppose that x is a realization from (1), i.e., each row x i is drawn independently from N q (µ i , Σ), i = 1, 2, . . ., n, and F(t; µ, σ, S) denotes the cumulative distribution function (CDF) of a N (µ, σ2 ) random variable truncated to the set S. If Ĥ0j : μ Ĝj = μ Ĝ′ j holds for a given j ∈ {1, 2, . . ., q}, then we have that p j,k-means = 1 − F [ν T x] j ; 0, Σ jj ∥ν∥ 2 2 ; S j x; T t=1 C (t) (x) + F − [ν T x] j ; 0, Σ jj ∥ν∥ 2 2 ; S j x; T t=1 C (t) (x) . (32) Here, C (t) (x) = (c n ) is the estimated cluster at the tth iteration of the k-means algorithm, Algorithm 1: Lloyd's algorithm for k-means clustering [19] Input: Data x 1 , . . ., x n ∈ R q , number of output clusters K, maximum iteration T , random seed s.
K ) by sampling K observations from x 1 , . . ., x n without replacement, using the random seed s.It follows from Theorem 3 that computing the p-value p j,k-means reduces to characterizing the set S j x; T t=1 C (t) (x) in (33).It turns out we could directly leverage the results in Chen and Witten [6] (reproduced below as Lemma 1), which tests for a difference in the means for the entire vector, to arrive at a computationally-efficient recipe.
Corollary 1 (Chen and Witten (2023)).For any w ∈ R q , define the set S w; Suppose that we apply the k-means clustering algorithm (Algorithm 1) to a matrix x ∈ R n×q , to obtain K clusters in at most T steps.Then, the set S w; T t=1 C (t) (x) defined in (34) can be computed in O nKT (n + q) + nKT log(nKT ) operations.
Proof.Corollary 1 follows from the observation Proposition 5 in Chen and Witten [6] proves the special case of S w; T t=1 C (t) (x) with w = ν T x/∥ν T x∥ 2 .However, no special properties of the vector νT x/∥ν T x∥ 2 was used in the proof, and replacing νT x/∥ν T x∥ 2 with Σ j /Σ jj everywhere in their proof yields the desired result.

C Additional simulation results
In Section 5.3, we compared the conditional power of the tests based on p j,k-means , p j,average , p j,centroid , and p j,single under model (1).
Here, we consider a different notion of power that does not condition on the estimated clusters Ĝ and Ĝ′ being true clusters.In this case, comparing the power of the tests requires a bit of

Figure 1 :
Figure 1: We simulated one data set from (1) with µ and Σ specified in (7).(a) Empirical distribution of feature 2 based on the simulated data set.(b): We apply k-means clustering to obtain two clusters and plot the empirical distribution of feature 2 stratified by the clusters.

Figure 2
Figure 2 illustrates a realization of (1) with n = 30, q = 2, and a covariance matrix Σ encoding moderate correlation (0.4) between any two features.Panel (a) displays the observed data x, which corresponds to x ′ (ϕ) with ϕ = x T 1 ν = −3.Here, ν was chosen to test the difference between Ĝ (shown in blue) and Ĝ′ (shown in rosy brown), estimated from k-means clustering with K = 3. Panels (b) and (c) of Figure 2 display x ′ (ϕ) with ϕ = 0 and ϕ = 6, respectively.In panel (b), with ϕ = 0, the blue and rosy brown clusters are "pushed together" in the first feature, resulting in x ′ (ϕ) T 1 ν = 0; that is, there is no difference in empirical means between feature 1 (x-axis of panel (b)) of the two clusters under consideration.By contrast, in panel (c), with ϕ = −5, the blue and rosy brown clusters are "pulled apart", which results in an increased distance between the first feature of the blue and rosy brown clusters.

Figure 2 :
Figure 2: One simulated data set generated from model (1) with µ i = 1{1 ≤ i ≤ 10} [0, 2.5] T + 5, −2.5] T and Σ = 0.2 • [1, 0.4; 0.4, 1].(a): The original data x corresponds to ϕ = −3.Applying k-means clustering with K = 3 yields three clusters (rosy brown, blue, and orange).Here, ν is chosen to test for a difference in means between Ĝ (blue) and Ĝ′ (rosy brown).Empirical means for features 1 and 2 are displayed in dashed lines for G and G ′ .(b): Perturbed data x ′ (ϕ, 1) at ϕ = 0 results in no empirical mean difference for the first feature between the blue and rosy brown clusters.(c): With x ′ (ϕ, 1) at ϕ = −5, the mean difference for the first feature becomes more pronounced.(d): The empirical difference in features 1 (red line) and 2 (blue line) as a function of ϕ.The slope of the red line is 1 by the definition of

) Theorem 2 (
Gao et al. (2023)).Let K > 1 and consider applying hierarchical clustering to the squared Euclidean distance matrix and cutting the resulting dendrogram to get K clusters.Then, for any realization x from (1), and any w ∈ R q , the set S(w) can be computed in at most O(n 2 +n log(n)) operations for single and average linkage, O(n 3 +n log(n)) operations for centroid linkage, and O(n 2 + n log(n)) operations for Ward linkage.

Figure 6 :
Figure 6: (a): Quantile-quantile plot of the p-values pj,naive and pj,k-means aggregated over features j = 1, . . ., 500 and all pairs of estimated clusters on the "no cluster" data set.(b): Number of rejected null hypotheses at different nominal FDR levels using BH-procedure-adjusted p-values from (a).

Figure 7 :
Figure 7: (a): Quantile-quantile plot of the p-values pj,naive and pj,k-means aggregated over features j = 1, . . ., 500 and all pairs of estimated clusters on the "cluster" data set.(b): Number of rejected null hypotheses at different nominal FDR levels using BH-adjusted p-values from (a).

C
(t) (x) = ϕ ∈ R : T t=1 C(t) x + (ϕ − (x Ĝj − x Ĝ′ j )Ĥ0j whenever p k-means,j ≤ α controls the selective type I error rate at level α.Proof.The proof of Theorem 3 follows directly from the proof of Theorem 1 in Section 1 of the Appendix by replacing the clustering output C(•) with T t=1 C (t) (•).